docs(infra): infra/README.md — bootstrap runbook (PR 5 of Addison's plan)#4901
Conversation
… add-host PR 5 (final) of Addison's NixOS-AI-cluster bootstrap plan. infra/README.md is the human entry point: tree diagram, bootstrap runbook (4 steps from ISO to running cluster), bootstrap order (9 steps from control-plane boot to self-managing cluster), add-a- workload flow, add-a-host flow, update procedures, secrets posture, devShell usage. The optional scripts/build-usb.sh from the original plan is skipped per Rule 0 (no .sh outside tools/setup/). The one-liner equivalent (`nix build .#installer-iso` + `sudo dd`) is documented in the README's "Build the installer ISO" section. This completes the file tree Addison enumerated: ✓ flake.nix (PR #4898) ✓ flake.lock (generated by `nix flake update`; not authored) ✓ .gitignore additions (PR #4898) ✓ infra/nixos/modules/{common,k3s-server,k3s-agent,gpu}.nix (PR #4898) ✓ infra/nixos/hosts/installer/configuration.nix (PR #4897) ✓ infra/nixos/hosts/control-plane/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-01/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-02/ (PR #4899) ✓ infra/k8s/bootstrap/{argocd-namespace,argocd-install,initial-orleans}.yaml (PR #4900) ✓ infra/k8s/applications/root-application.yaml (PR #4900) ✓ infra/k8s/applications/orleans/{Application,deployment,service,rbac,configmap}.yaml (PR #4900) ✓ infra/k8s/applications/{gitlab,argoworkflows,argorollouts}/Application.yaml (PR #4900) ✓ infra/README.md (this PR) ⨯ scripts/build-usb.sh (skipped — Rule 0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2495c2c
into
feat/addison-flake-and-modules-2026-05-24
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 887d1d6ea9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. **Control-plane boots** → K3S server starts with embedded etcd | ||
| 2. K3S applies `infra/k8s/bootstrap/argocd-namespace.yaml` | ||
| 3. K3S applies `infra/k8s/bootstrap/argocd-install.yaml` → ArgoCD pods come up | ||
| 4. K3S applies `infra/k8s/bootstrap/initial-orleans.yaml` → Orleans namespace + skeleton StatefulSet |
There was a problem hiding this comment.
Remove nonexistent bootstrap step for initial-orleans
This step documents infra/k8s/bootstrap/initial-orleans.yaml as a K3S first-boot auto-apply, but services.k3s.manifests in infra/nixos/modules/k3s-server.nix only registers argocd-namespace, argocd-install, and root-application. That mismatch makes the runbook inaccurate during bring-up and can cause operators to troubleshoot for a bootstrap manifest that is never actually applied by K3S.
Useful? React with 👍 / 👎.
| Bump the `targetRevision` in the corresponding `Application.yaml` and | ||
| commit. ArgoCD reconciles automatically. |
There was a problem hiding this comment.
Fix ArgoCD upgrade instructions to point at real source
The update guidance says to bump targetRevision in the corresponding Application.yaml for ArgoCD, but ArgoCD in this repo is pinned via infra/k8s/bootstrap/argocd-install.yaml (remote manifest tag), and there is no ArgoCD Application.yaml to edit. As written, the documented ArgoCD upgrade path is not executable and will send maintainers to the wrong file.
Useful? React with 👍 / 👎.
#4898) * feat(infra): flake.nix + shared NixOS modules (common, k3s-server, k3s-agent, gpu) Wires the installer config from PR #4897 into a buildable flake and seeds the shared modules every cluster host will import. flake.nix: - nixosConfigurations.installer → builds the USB ISO that PR #4897 declared - packages.installer-iso convenience alias - devShells.default with cluster admin toolkit - nixosModules.{common,k3s-server,k3s-agent,gpu} for downstream per-host configs infra/nixos/modules/common.nix: - Nix + flakes settings (cache, GC, trusted users) - Locale + time defaults - NetworkManager + firewall ON - SSH key-only (no PermitRootLogin password, no PasswordAuthentication) - `zeta` admin user (no initialPassword) - Baseline package set (git/vim/htop/kubectl/k9s/etc) - systemd-boot UEFI - powerManagement.cpuFreqGovernor = "performance" (AI workloads) infra/nixos/modules/k3s-server.nix: - role=server with embedded etcd (clusterInit=true) - Disables bundled servicelb + traefik (ArgoCD will land replacements) - Auto-applies k8s/bootstrap/* manifests on first boot so ArgoCD self-installs and immediately starts reconciling root-application - Firewall opens 6443/10250/2379/2380 + 8472/udp for flannel - KUBECONFIG env baked in infra/nixos/modules/k3s-agent.nix: - role=agent joins via serverAddr + tokenFile - Node label zeta.io/role=worker for placement - Firewall opens 10250 + 8472/udp infra/nixos/modules/gpu.nix: - NVIDIA driver (production branch) + container toolkit - allowUnfreePredicate scoped to nvidia + cuda packages only - nvidia-persistenced enabled (avoids first-pod cold-start tax) - Node label zeta.io/gpu=nvidia for `nvidia.com/gpu` pod requests - nvtop + cudart + nvcc on the host for diagnostics .gitignore additions: - result, result-* (nix build outputs) - .direnv/, .envrc.local (worker-shell flake integration) - .nix-eval-cache/ - /hardware-configuration.nix (top-level only; per-host configs keep theirs under infra/nixos/hosts/<host>/) Tokens are placeholder-pathed (tokenFile = /var/lib/rancher/k3s/.../token) so plaintext secrets never land in Git. sops-nix or agenix wiring lands in a follow-up PR alongside the per-host configs that need real tokens. Bootstrap manifests referenced from k3s-server.nix (k8s/bootstrap/*) land in PR 3; until then the manifests reference resolves to a not-yet-existent path, which is fine because no host imports k3s-server.nix yet (per-host configs land in PR 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(infra): per-host configs (control-plane + worker-gpu-01/02) (#4899) PR 3 of Addison's NixOS-AI-cluster bootstrap plan. Adds three host configs that compose the shared modules from PR #4898: infra/nixos/hosts/control-plane/: - configuration.nix: imports common + k3s-server - hardware-configuration.nix: placeholder (replaced during install by `nixos-generate-config --root /mnt`) - README.md: install runbook, post-install verification commands infra/nixos/hosts/worker-gpu-01/: - configuration.nix: imports common + k3s-agent + gpu - serverAddr points at control-plane.zeta.local:6443 - hardware-configuration.nix: placeholder infra/nixos/hosts/worker-gpu-02/: - identical shape to worker-gpu-01 (separate file so per-machine labels / hardware specifics declare per host) flake.nix: - nixosConfigurations now exposes control-plane, worker-gpu-01, worker-gpu-02 alongside installer Placeholder hardware-configuration.nix files ship with minimal valid stubs (not-detected.nix import + DHCP + ext4 by-label devices) so `nix flake check` succeeds in CI. Each comment block names the generator command that replaces them during real install. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(infra): k8s bootstrap + ArgoCD App-of-Apps (orleans, gitlab, argo-workflows, argo-rollouts) (#4900) PR 4 of Addison's NixOS-AI-cluster bootstrap plan. Lands the Kubernetes substrate that K3S auto-applies and ArgoCD then takes over reconciling. infra/k8s/bootstrap/ (K3S auto-applies on first boot via services.k3s.manifests in k3s-server.nix): - argocd-namespace.yaml - argocd-install.yaml: kustomize ref to ArgoCD v2.13.2 upstream manifest (pinned for reproducibility) - initial-orleans.yaml: minimal Orleans bootstrap StatefulSet scaled to replicas: 0 until a real silo image is published. Includes namespace, ServiceAccount, Role+RoleBinding for Kubernetes-clustering pod/endpoint discovery, headless service, client gateway service. infra/k8s/applications/ (ArgoCD watches this dir recursively): - root-application.yaml: App-of-Apps root; auto-applied by K3S. Selects Application.yaml at any depth via include glob. - orleans/Application.yaml: ArgoCD-managed Orleans, supersedes the bootstrap StatefulSet once reconcile completes - orleans/{deployment,service,rbac,configmap}.yaml: full Orleans StatefulSet (replicas: 0 placeholder), headless silo + client + dashboard services, RBAC, cluster config - gitlab/Application.yaml: GitLab CE Helm chart with bundled cert-manager/nginx/prometheus DISABLED (cluster has its own) and runners enabled for in-cluster CI - argoworkflows/Application.yaml: Argo Workflows 3.6 family; 7-day workflow retention; parallelism 50 - argorollouts/Application.yaml: Argo Rollouts 1.8 family with dashboard enabled for canary/blue-green inspection Add-a-workload-to-the-cluster flow: 1. mkdir infra/k8s/applications/<name>/ 2. write Application.yaml + supporting manifests 3. git commit + push to main 4. ArgoCD picks it up on next sync (~3 min) 5. K3S applies it The flake IS the tick source. The cluster reconciles toward it. Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(infra): add infra/README.md — bootstrap runbook + add-workload + add-host (#4901) PR 5 (final) of Addison's NixOS-AI-cluster bootstrap plan. infra/README.md is the human entry point: tree diagram, bootstrap runbook (4 steps from ISO to running cluster), bootstrap order (9 steps from control-plane boot to self-managing cluster), add-a- workload flow, add-a-host flow, update procedures, secrets posture, devShell usage. The optional scripts/build-usb.sh from the original plan is skipped per Rule 0 (no .sh outside tools/setup/). The one-liner equivalent (`nix build .#installer-iso` + `sudo dd`) is documented in the README's "Build the installer ISO" section. This completes the file tree Addison enumerated: ✓ flake.nix (PR #4898) ✓ flake.lock (generated by `nix flake update`; not authored) ✓ .gitignore additions (PR #4898) ✓ infra/nixos/modules/{common,k3s-server,k3s-agent,gpu}.nix (PR #4898) ✓ infra/nixos/hosts/installer/configuration.nix (PR #4897) ✓ infra/nixos/hosts/control-plane/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-01/ (PR #4899) ✓ infra/nixos/hosts/worker-gpu-02/ (PR #4899) ✓ infra/k8s/bootstrap/{argocd-namespace,argocd-install,initial-orleans}.yaml (PR #4900) ✓ infra/k8s/applications/root-application.yaml (PR #4900) ✓ infra/k8s/applications/orleans/{Application,deployment,service,rbac,configmap}.yaml (PR #4900) ✓ infra/k8s/applications/{gitlab,argoworkflows,argorollouts}/Application.yaml (PR #4900) ✓ infra/README.md (this PR) ⨯ scripts/build-usb.sh (skipped — Rule 0) Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): address Copilot P0/P1 review on PR #4898 7 unresolved threads from Copilot — fixed the real issues, marking the rest as outdated. P0 fixes: k3s-server.nix manifests path (line 55): `../../../k8s/` resolves to `<repo-root>/k8s/` (one level above `infra/`), which doesn't exist. Module at `infra/nixos/modules/` needs `../../k8s/` to reach `infra/k8s/`. Fixed all 3 references (argocd-namespace, argocd-install, root-application). k3s-server.nix clusterInit (line 31): unconditional `true` is wrong for multi-server HA — only the first control-plane node should set clusterInit; additional servers join via serverAddr. Changed to `lib.mkDefault true` and documented the per-host override pattern for HA expansion. P1 fixes: k3s-server.nix kubeconfig mode (line 40): `--write-kubeconfig-mode=0644` makes the admin kubeconfig world-readable, leaking cluster-admin creds to any unprivileged user on the control-plane node. Changed to 0640 + `--write-kubeconfig-group=wheel` so the wheel group can use kubectl without sudo, but other users can't read it. flake.nix supportedSystems / installer-iso (line 75): the installer NixOS config is x86_64-linux only, but the package was published on all `supportedSystems` including aarch64-linux, which would fail evaluation. Gated with `nixpkgs.lib.optionalAttrs (system == "x86_64-linux")` so `packages.aarch64-linux` is empty and `packages.x86_64-linux.installer-iso` resolves cleanly. devShell + formatter remain on all systems. Marked outdated (no fix needed): flake.nix line 20 (Copilot reviewed before PR #4900 landed): k8s/ directory now exists post-stack-merge. flake.nix line 80 (Copilot reviewed before PR #4899 landed): "Future hosts land in PR 2" comment removed by per-host PR. Not fixed in this commit: flake.lock (Copilot P1 line 5): requires `nix flake update` on a machine with Nix installed; not present on the autonomous agent's workstation. First maintainer with Nix runs the update and commits the resulting lock file as a follow-up — that commit is byte-stable and reviewable in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): address 2nd Copilot review wave (P0 port, README, comments) Second batch of Copilot review (8 new threads in addition to the 7 from the first wave already fixed in the prior commit). P0 fix: k3s-server.nix firewall (line 77): missing port 9345/TCP — the K3S supervisor/registration port. Without it, agents cannot complete the join handshake and additional server nodes can't participate in HA. Added with explanatory comment. P1 fixes: infra/README.md serverAddr scheme (line 92): documented as `control-plane.zeta.local:6443` without `https://`. NixOS `services.k3s.serverAddr` requires the scheme. Updated to show the full URL and named the validation constraint. infra/README.md secrets section (line 125): only documented the server-role token path. Agent-role token path (/var/lib/rancher/k3s/agent/token) was missing. Added both, plus the openssl generation one-liner and the "same value to all nodes" requirement. infra/k8s/bootstrap/initial-orleans.yaml header (line 6): contained named attribution + anthropomorphic content. Per codebase convention for current-state infra manifests, rewritten as factual scope description. P2 fixes: infra/nixos/modules/gpu.nix open-modules comment (line 51): said "Open-source kernel modules — works on RTX 20-series and newer" but default was `false` (proprietary). Comment now matches the default (proprietary chosen for hardware compatibility) and names the per-host override for newer-only nodes. infra/k8s/bootstrap/argocd-install.yaml ArgoCD version comment (line 25): conflated ArgoCD-version pinning with targetRevision-of-tracked-Git-ref. Rewrote to separate the two upgrade vectors clearly. Already-resolved-by-prior-commit (no fix needed in this commit): k3s-server.nix line 49 P0 (write-kubeconfig-mode 0644): same finding as line 54 from the first wave; the prior commit already changed to 0640 + group=wheel. flake.nix line 53 P1 (aarch64-linux): same finding as the line 75 thread from the first wave; the prior commit already gated installer-iso to x86_64-linux only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): markdownlint MD007 — nested list indent in README CI surfaced two MD007/ul-indent errors at infra/README.md lines 126-127: nested unordered list items used 4-space indent (expected 2-space). The nested list was the per-role K3S token path enumeration added in the prior 2nd-wave fix commit (PR #4898 Copilot review thread PRRT_kwDOSF9kNM6EcQFw). Restructured to fix the indent and add a blank line between the nested list and the continuation paragraph so markdownlint sees the structure cleanly. Verified locally: `npx markdownlint-cli2 infra/README.md` now returns 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): 3rd Copilot review wave (manifest, .example refs, framing) 7 new threads after the prior fix-batch CI pass. P1 fixes: k3s-server.nix manifests + README claim: README documented K3S applying initial-orleans.yaml, but it wasn't in the manifests list. ADD initial-orleans.source to manifests (the right fix — Orleans namespace + RBAC + skeleton StatefulSet should be bootstrapped alongside ArgoCD). Updated the surrounding comment to reflect the 4-manifest seed (argocd-namespace, argocd-install, initial-orleans, root-application). k3s-server.nix servicelb/traefik comment (line 52): said "ArgoCD will install MetalLB + ingress-nginx as Applications" but no such Applications exist under infra/k8s/applications/. Reworded to name the bootstrap-period gap (LoadBalancer Services stay Pending; use NodePort or host-network during bootstrap). control-plane/README.md (line 58): referenced hardware-configuration.nix.example which was removed in the fix-up that gave each host a real placeholder. Replaced with current-state description (placeholder content + generator command to replace it on real install). worker-gpu-01/configuration.nix import comment (line 12): same hardware-configuration.nix.example reference. Updated to match current placeholder + generator command pattern. P2 fix: infra/README.md "The framing" section (line 148): contained named attribution ("Per Addison's spec") in a current-state infra doc. Per codebase convention for current-state surfaces (vs. history/roster surfaces which exempt attribution), reworded as factual design statement; kept the substrate intent (declarative desired state, drift reconciliation, single source of truth) without the attribution. Resolved as outdated (no fix needed): flake.nix line 90 (Copilot reading the PR DESCRIPTION which said per-host configs land in PR 3 — the stack collapsed and this PR contains them; description is historical and stale, the code is correct). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Lior <lior@zeta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
PR 5 (final) of Addison's NixOS-AI-cluster bootstrap plan.
Base: #4898 (stacked on the full infrastructure tree).
Content
infra/README.md— the human entry point for the cluster bootstrap:infra/What's NOT in this PR
scripts/build-usb.sh— skipped per Rule 0 (no.shoutsidetools/setup/). The one-liner equivalent (nix build .#installer-iso+sudo dd) is documented in the README's "Build the installer ISO" section.Plan completion
This PR closes the file enumeration from Addison's spec:
flake.nixflake.locknix flake update; not authored).gitignoreadditionsinfra/nixos/modules/{common,k3s-server,k3s-agent,gpu}.nixinfra/nixos/hosts/installer/configuration.nixinfra/nixos/hosts/control-plane/infra/nixos/hosts/worker-gpu-01/infra/nixos/hosts/worker-gpu-02/infra/k8s/bootstrap/*infra/k8s/applications/root-application.yamlinfra/k8s/applications/orleans/*infra/k8s/applications/{gitlab,argoworkflows,argorollouts}/Application.yamlinfra/README.mdscripts/build-usb.shTest plan
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com